Drivers of the decrease of patent similarities from 1976 to 2021

The citation network of patents citing prior art arises from the legal obligation of patent applicants to properly disclose their invention. One way to study the relationship between current patents and their antecedents is by analyzing the similarity between the textual elements of patents. Many patent similarity indicators have shown a constant decrease since the mid-70s. Although several explanations have been proposed, more comprehensive analyses of this phenomenon have been rare. In this paper, we use a computationally efficient measure of patent similarity scores that leverages state-of-the-art Natural Language Processing tools, to investigate potential drivers of this apparent similarity decrease. This is achieved by modeling patent similarity scores by means of generalized additive models. We found that non-linear modeling specifications are able to distinguish between distinct, temporally varying drivers of the patent similarity levels that explain more variation in the data (R2 ∼ 18%) compared to previous methods. Moreover, the model reveals an underlying trend in similarity scores that is fundamentally different from the one presented previously.


Introduction
Understanding the characteristics of ground-breaking innovations is crucial for technologybased firms striving for success [1]. Patent indicators serve this purpose and support, among the others, the development of product strategies [2,3], monitoring of existing technological trends [4], the detection of promising opportunity of investments [5], the assessment of technological impact of novel applications [6,7], and recognizing similar technologies [8][9][10][11].
Patent indicators using institutional classifications and citation information are predominant [6,7,12,13] in patent analysis. Patent classification systems like the International Patent Classification (IPC) are usually processed for identifying patents that are technologically similar. However, technological relatedness may not be fully captured by sharing the same patent class. Despite numerous methods for analyzing technological relatedness and closeness based on such classes [14], their usage can be problematic when patents need to be identified, compared, or matched with similar technologies.
In contrast, patent indicators using patent descriptions and lexical contents are less common in patent analysis. A keyword-based approach using frequency and co-occurrence of contents is typically used for computing the technological similarity between pairs of patents [15]. Within the set of patent indicators, patent similarity is a fundamental measure in the evaluation of technological novelty [2] and infringement risks associated with others using or selling inventions without authorization [8]. Patent descriptions can be mined for combinations of words and unique expressions for text-based indicators for patent similarity. This transforms unstructured textual data into actionable knowledge through latent relationships between patent documents [16].
With the development of new and more sophisticated deep learning techniques, Natural Language Processing (NLP) tools have been proven to provide valid alternatives to canonical technology class measurements. The idea is to use the textual elements of patents as inputs for defining vectors of similarity. In this way, it is possible to use continuous distance measures between any two patents, e.g., Euclidean distance, cosine similarity, or Mahalanobis distance to measure patent (dis)similarity. Although the idea of mapping patents into a vector space can be traced back to [17,18], only recently these methods have been applied to patent analysis. For example, [15] used a bag of words methodology [19] to develop a machine-automated patent-to-patent similarity measure based on the technical descriptions of patent applications. Adopting the same approach, [9] analyzed pairs of patent citations in the US between 1975 and 2014. Simple vocabulary-based approaches of textual similarity scores across citing and cited patents, may contain major drawbacks caused by the sparsity of the output matrix. Although there have been developments to address this weakness [20], a neural network (NN) approach, such as the one proposed by [10], is preferred as semantics and context are prioritized within the estimated positional embeddings. The introduction of language based NN models has opened up the way for more complex applications within patent similarity analyses. While early contributions have focused on patent abstract data for correctly classifying patents into their technological classes [21,22], the focus is now shifting towards mapping patents into multidimensional spaces to detect patterns and gain relational insights. In this regard, [23] proposed using a K-nearest-neighbors algorithm to spot closely related patents by training a Word2Vec NN model [24] on 48 million abstracts. Regardless of the amount of data processed, the computational cost of these approaches are high. Instead, the current availability of generic models pre-trained on massive corpora is rapidly increasing [25]. This has enabled researchers to unlock vast complex natural language models with fewer computational resources, paving the way for a new set of tools.
In the context of textual similarity analysis in patent citations, [9] and [10], noted a decrease in the average textual similarity per year between citing and cited patents. The aim of this manuscript is to investigate the drivers of patent similarity decline during a period of approximately forty years, from 1976 to 2021, with 1976 the year when the US Patent Trading Office (USPTO) started collecting the full text for all granted patents in digital databases. Previous studies of the decrease of patent similarity attribute this drop to fundamental changes that occurred in the data generation process. [26] claim that legal changes in the applicant's duty of disclosure has led to a drastic increase in the number and scope of cited references. As a consequence, more citations have been included that are further afield from the citing patent. Pursuing this hypothesis, [9] show how the skewed distribution of backward citations has become less informative for research practices, as a small minority of patent applications are now generating a large majority of patent citations in the overall citation network.
We propose to use pre-trained models to compute the embeddings. In this sense, we avoid any computational procedure by proposing instead a ready-to-use approach for computing similarity scores. We focus on patent abstracts that contain the most concise information regarding the patenting technology [23,27]. Thanks to the reduced size of the abstract corpus, we are able to compute the positional embeddings via a pre-trained SBERT model in a reasonable amount of time. We encode the entire set of roughly 10 million abstracts into fixed sized vectors and compute the vector of similarity scores across 100 million patents citations through a parallelized lazy loading scheme.
We will first describe the USPTO patent data on which we base our analysis. We then describe the SBERT embedding of the abstracts data and the calculation of the patent similarity scores. The scores confirm the downward trend in the patent similarity scores. Then we propose various generalized additive models with the aim of detecting the drivers of patent similarity over time, in particular, whether this is a temporal endogenous process or due to exogenous patent attributes. In contrast to previous studies, our approach also aims to resolve the problem of the temporal boundary of the citation network by considering the time lag between the citing and the cited patents.

USPTO patent data
Intellectual property history can be traced back to the 19th century when the first patenting office was established in Paris. Since then the patenting documentation has evolved and the availability of patent data has grown dramatically. One of the main challenges of analyzing patent data is retrieving the required information from the large amount available. Moreover, patents are legal documents, mostly consisting of textual elements. Unfortunately, the nonavailability of standardized patent formats through the years has caused difficulties in building standardized data bases. Moreover, the juridical procedures of patenting are country-specific. This creates inconsistencies in the data from different countries, as some patenting offices will use different citation procedures. A striking example is a distinction in the citation process between the USPTO and the European Patent Office (EPO). Both the USPTO and the EPO require applicants to fulfill their duty of disclosure by citing all the required prior arts. The examiner committee of the USPTO adds citations to the application by integrating all those prior arts that are considered relevant for the patent to be correctly disclosed. On the other side, the EPO examiner committee does not include any further citations in the examination process. The committee limits its range of action by evaluating the validity of the patent combined with the disclosed prior arts. From this perspective, a combined analysis of multiple patenting offices' data would result in unreliable conclusions.
For this reason, we focus our analysis exclusively on patents that have been issued by the USPTO from January 1976 up to September 2021. Starting from 1976, the USPTO has created an online public repository storing all the issued patents, including guidelines for data quality and standardization in the textual component of submitted legal documents. Although the USPTO data are broadly available across different periods, we have noted that most common repositories contain many inaccuracies. Such issues are usually the result of heavy preprocessing procedures used to combine, correct, or fill missing values from distinct sources to integrate the range of data that the USPTO provides publicly. To retain the highest quality possible in our dataset, we avoid third-party preprocessing and download data directly from the USPTO digital repository (https://bulkdata.uspto.gov/). After downloading the required XML files, these were processed and combined to obtain CSV files through an open-source software tool (available at: https://github.com/iamlemec/fastpat).
Our dataset consists of a time-stamped citation network along with patent attributes. For each granted patent we consider its backward citations, and for each patent in the dataset we include International Patent Classification (IPC) codes. In line with the network science vocabulary, we refer to citing patents as senders and to cited patents as receivers.

The unreliability of institutional classification schemes
Patent classification schemes like those illustrated in the IPC Table 1 are designed for examiners to ease the examination process of patent applications by rapidly searching for similar or related technologies. Studies on innovation use such instruments to analyze potential technological patterns, usually through similarity levels derived from co-class proximity measures [14].
It has been argued that institutional classification schemes do not offer a reliable picture of patent similarity. [15] explain how many sources of bias may emerge when comparing patents through the technological classes they belong to. On the one hand, patent classes are not fixed -i.e., new technological classes may be created and old classes may be merged, split, and/or reassigned in a way that affect the depth of technological spaces. On the other hand, the classes may be too broad or too tight, leading to inaccurate comparisons.
We compare a random sample of 1 million citations through their sections and sub-classes as defined in the IPC classification, where sections are the broader category and sub-classes are the preferred level of analysis in empirical applications. Fig 1 clearly shows that any measure of patent similarity based on institutional classifications suffers from a selection bias in the hierarchy of classification layers. In fact, while technology classes tend to self-cite, which produce higher similarity scores, technology sub-classes tend to cite outside of their area.

Patent similarity based on pre-trained SBERT
NLP tools can be used to interpret textual and translate it into a mathematical form in order for other algorithms to accomplish predefined tasks. What has been a revolution for NLP was the introduction of the Transformer architecture [28] in a field that was previously dominated by Recurrent Neural Networks and Long-Short-Term-Memory networks. The great step that made the Transformers the new go-to tools for NLP is the focus on attention mechanisms [29] which replaced recurrence functions with a large number of parameters. Instead of processing the sentence sequentially (or word-by-word), the attention mechanism processes the entire sequence to give weights to the input. In this sense, it decides how much each word in the input is associated to the sentence. In this way, it runs a probabilistic-like approach that prioritizes certain parts over others. Transformers then combine an encoder-decoder architecture which solely relies on the attention mechanism to forward more parts of the input sequence at once (see [30] for an overview).
The Bidirectional Encoder Representations from Transformer (BERT) [31] takes this concept and extends it by using the context coming from both sides of the current analyzed part of the input. This change is significant as often a word may change meaning while the sentence develops. Each word added augments the overall meaning of the word being analyzed. The more words that are present in total in each input sequence, the more ambiguous the word in focus becomes. BERT accounts for the augmented meaning by reading the input bidirectionally, accounting for the effect of all other words in the input on the focus word and eliminating the left-to-right shift that biases words towards a certain meaning as the sentence progresses.
Although BERT outperforms any other benchmark that was set by previous NLP tools to encode the meanings of words into queries, it does not perform well when it comes to comparing similarities of entire sentences. A large disadvantage of the BERT network structure as presented by [31] is that independent sentence embeddings are not computed, which makes it difficult to derive sentence embeddings from BERT. In response, [32] modified the standard BERT architecture for semantic textual similarity, called Sentence-BERT (or SBERT), while also reducing computing time. The main difference with the regular BERT architecture is encoding the semantic meaning of whole sentences instead of individual words. The SBERT architecture is characterized by a so-called twin network, which allows it to process two sentences simultaneously. The twins are identical down to every parameter, which allows it to think of this architecture as a single model used multiple times. At the end of the SBERT pipeline, the model contains a final pooling layer that enables the creation of a fixed-size representation for input sentences of varying lengths. With this, it is possible to encode documents into fixed-sized vectors, while taking their semantics into account.
The downside of models based on Transformer architectures is that these are among the most computationally intensive Neural Networks to train. The peer-to-peer Hugging Face repository solve this deficiency and allows researchers to upload trained models in an opensource fashion. With this tool, access to deep and complex neural networks are in within the reach of every user. Moreover, pre-trained SBERTs have two other advantages over competing models. The first is that SBERT pre-trained models uploaded on Hugging Face are trained either for specific or general purposes. General purpose models are trained on billions of generic documents, which grants flexibility to use SBERT for any task. The second one is the ease of use granted by the package Sentence Transformers, which simplifies the procedure of creating and downloading the pre-trained weights from Hugging Face with a few lines of code. These two reasons, combined with the established benchmarks that SBERT has set in the field of NLP, make this the go-to model for our task. The Sentence Transformers package in python gives access to pre-trained models from the Hugging Face repository that encodes sequences into fixed-sized vectors. From this library, we downloaded a model, trained and fine-tuned on more than one billion public documents that encodes texts into vectors of size 384.
Similarly to [10] and [27], we removed non-utility patents (such as plants or designs) from our data when computing embeddings. In this way, we encoded approximately 7.5 million patents into a fixed-size space through SBERT. By parallelizing a scheme of lazy loading procedures, we managed to compute the patent similarity scores for almost 100 million patent citations within minutes. Confirming results from previous studies. Fig 2 shows that the average similarity per year between citing papers is decreasing over time. The cosine similarity scores range between -1 and 1, we multiplied them by 100 for ease of representation.
Embeddings were computed on a cluster node on parallel with two NVIDIA graphic processor units, models GeForce GTX 1080 Ti with 3584 CUDA cores and 100 GB of RAM. Computation time stands within one hour (more information on the resources that were used can be found on https://intranet.ics.usi.ch/HPC). Thanks to the usage of a pre-trained architecture, there is no training involved in the computation of the embeddings. This leaves the machine resources to take advantage of this method by taking the abstracts as input and provide only one forward passage among the neural network. Given this simplified procedure, the same results can be obtained within a reasonable time using less powerful computational resources-e.g., standard Colab notebooks.

Modeling similarity scores through GAMs
It has been claimed that due to the changes in the generative process of patent citations, these citations have become less informative and representative [9]. We argue instead that with the correct application of informative statistical models, it is still possible to gain important insights on the main drivers of the decrease of patent similarity. As such, we argue that the backward citation process still plays a major role in determining the technological proximity of patents and the direction in which the network of citations is expanding. We propose to model textual similarity scores through GAMs by extending the approach of [9]. GAMs can be used to estimate non-linear effects of covariates on the dependent variable. More in detail, while in linear models the predictor is a weighted sum of the p covariates, P p j¼1 b j x j , in GAMs this term is replaced by a sum of functions, e.g.
Model 0. Using this modeling technique, the decrease of patent similarity can be visualized by a simple GAM with the patent publication date as unique covariate modeled by a smooth term (Model 0). Model 1. The average decrease of similarity in backward citations is associated with an increase in the average temporal lag elapsed between citing and cited patents (see Fig 3). This result seems to suggest that applicants and examiners cite prior arts which are increasingly temporally distant to their application/grant time. By itself, this effect could be a source of the reduction of similarity levels as the innovation process gives reasons to believe that temporally

PLOS ONE
Drivers of the decrease of patent similarities from 1976 to 2021 distant technologies are less similar to newer ones. In addition, in a period of approximately forty years, the legal and technical language has seen some important changes. Although the usage of SBERT should eventually mitigate the change of language as the model would account for context and semantics, the language evolution follows the technological development present inside patents, thus increasing the reduction of a potential similarity effect.
In our model specification, the temporal component of patent citations is captured by the covariate temporal difference. Its effect on patent similarity is modeled through a smoothing spline of the time lag (in days) between issued dates of the cited and citing patent. Together with the sender publication date, we account for temporal effects that address the impact of two fundamental temporal components that influence the similarity levels (Model 1). Model 2. [26] argued how legal changes that occurred within the early 2000s amplified the incentives to disclose, increasing the number of citations per issued patent. This idea is also discussed in [9], where a negative effect for the number of backward citations has been observed. However, the inflation of the number of citations during the last period may well be a bias in which a linear effect does not properly take this into account. We address this issue by adding to the model a further covariate fitted through a smooth term: the backward citation count. With this explanatory variable, we correct for the increasing number of backward citations done by a given citing party. Furthermore, we consider the type of applicant who is providing the citation, discriminating between organizations (both profit and non-profit) and privates owners. The reason for this is straightforward: if there is an inflation in the number of citations, these could be more present within organizations than by private applicants as the former tend to cite more on average. We added three distinct fixed effects in the form of binary variables: is the same organization if the citing and the cited company of the patent coincide, is citing party an organization and is cited party an organization if either the owner of the cited is an organization (Model 2). Model 3. To complete our analysis, we introduced effects related to the IPC for both the citing and the cited party. As we explained, the usage of technological classes for the assessment of patent similarity is disputable [15], but still very important as a common source of knowledge for applicants and examiners. Following [14], for each pair of citing/cited patents we computed the Jaccard index of individual components in the IPC scheme. What we obtained are five distinct distributions that summarize the technological relatedness between two patents at different levels of the hierarchical classification system.

Results
The naive Model 0 is in line with previous studies, suggesting that patent similarity is decreasing with time (Fig 4). However, this view fails to take into account various confounding effects. In turn, we will focus our attention to the time lag between the citing patent and those patents it cites (Model 1), exogenous information associated with citing and cited patent owners, the increasing number of citations per patent (Model 2), and finally the IPC classification of the patents (Model 3). Empirical results are reported in Table 2. For each model specification, Fig  4 compares the estimated splines associated with each effect.

Temporal effects
Model 1 shows that the larger the temporal lag of a citation, the lower is the similarity between the citing and the cited patent. Correcting for temporal lag to study patent similarity is important, because it corrects for the patent citation network temporal boundary-i.e., only similarity information for patent pairs is available for patents issued after 1976. Related studies [9] do not apply such a correction, with the obvious consequence that their results may be biased and do not reflect the actual changes in similarity patterns. With this correction in place, the results show that patent similarity levels only start to decrease at the beginning of the 1990s after experimenting an initial increasing trend.

Citation effects
Model 2 reveals a downward trend in the log count of sender citation effect. This suggests that the higher the number of patent citations, the lower the pairwise similarity between citing and cited patents. The effect due to the inflation of citations also mitigates the decline of patent similarity since the 1990s.
Model 2 shows that citations in which patents are owned by an organization tend to have lower similarity levels. This seems to suggest that organizations tend to include in their patents a series of citations that are loosely related to theirs. On the other side, citations between patents that are owned by the same organization have an important positive impact on similarity. This may be a side-effect caused by the fact that, within organizations, the office responsible for filling patent applications may be the same. As such, words and technical descriptions may be coming from the same group of authors. Moreover, organizations involved in patent-intensive industries will likely tend to cite their own previous patents.  Table 2. https://doi.org/10.1371/journal.pone.0283247.g004

Class effects
The importance of including technology-related information through the Jaccard similarity of IPC is highlighted by the significant increase of deviance explained and reduction of the Generalized Cross-Validation (GCV) score. Empirical estimates of Model 3 show the import effect of lower levels of the patent classification scheme. Jaccard score computed at the IPC subgroup level has the strongest impact on patent pairwise similarity. This should be expected given the complex structure of the IPC framework. Patents that share the lowest branch of the classification system will likely include technologies with similar features and consequently similar textual and semantic components.
Finally, when controlling for technology similarity through Jaccard indices, the spline associated with publication date in Model 3 inverts its downward trend in 2011 and progressively increases up to 2021. This is an important result, as it suggests that after accounting for important confounders, the similarity between patents is not decreasing at all, but in fact may recently be slowly increasing.

Discussion
Patent similarity is a complex concept for which only proxy measures exist. Early approaches focused on classification-based measures, whereas more recently text-based similarity measures were introduced. In this paper, we focused our attention on a similarity measure based on SBERT, a direct evolution of the well known BERT. Nevertheless, the field of sentence embeddings is experiencing a constant development that is raising the bar in terms of model performances. A competitive alternative to SBERT could be the newly released "Definition Sentences" (DefSent, [34]) model, where performances of the this have proven to be marginally higher [35]. However, at the present time, there are not enough pre-trained models on which we can provide a fair comparison. Furthermore, although training other languages model on the patent corpus [10,23] could potentially improve the accuracy of the patent similarity score, it may not change our results. In fact, although we have argued that the SBERTbased measure used in this work is a state-of-the-art approach, it is important that the main conclusions presented in this manuscript do not depend exclusively on this way of calculating patent similarity. In parallel, we have repeated our analysis with the patent similarity data used by [9], which resulted in the same substantial results as presented in this manuscript. When studying real-world networks, the issue of network boundaries is almost unavoidable. In the study of patent citation networks, two important boundaries we encountered in our analysis are the fact that the citing patents are US based patents from after 1976. These temporal and spatial boundaries means, first, that our concrete conclusions are firmly restricted to the modern US reality. More general conclusions would be extrapolations with more or less empirical support. It would be interesting to repeat the study in alternative jurisdictions. Secondly, particularly the temporal boundary has an important effect on the main conclusion presented in this manuscript. The sharp decline of raw patent similarity since 1976 is only reversed to an effective increase in similarity, if we accept the hypothesis that patent citation lag is an effective way to account for the temporal boundary in the patent citation network. A longer observation period would make allow us to test this assumption more carefully.

Conclusion
Text-based similarity measures are among the most widely used indicators of patent relatedness. In this paper, we propose an efficient way to compute textual similarity scores using patent abstracts instead of entire technical descriptions. The measure we used show a similar pattern with respect to those of related studies, namely a decrease in text-based similarity starting in 1976 and continuing until recent times. The disadvantage of previous techniques is that they typically involve computationally intensive procedures that do not allow replication. In contrast, the approach of this paper avoids computational bottlenecks by making use of pretrained NN.
Although the changes in the legal framework have had consequences on the citation process, there are other components responsible for the apparent decrease in patent similarity. Simplistic model formulations have obscured the true effect of various factors on the trend in patent similarities. Our empirical analysis focuses on the large body of patent similarities and uses a generalized linear model (GAM) [33] to uncover the non-linear relationship between several drivers on the one hand and patent similarities on the other. Explanatory variables in the model specification include both patent-related attributes and network-based measures of technological novelty. Using several model specifications, our analyses show that the observed downward trend of patent similarity scores in the last forty years is the result of several distinct endogenous effects.
The main contribution of this work concerns the analysis of possible factors that are generating the observed downward trend in patent similarity. By means of multiple GAMs, we modeled a combination of fixed and non-linear effects. What emerged from the empirical analysis is that the trend in patent similarity is affected by a series of phenomena. The most important one is the effect of time lag between citing and cited patents. This can be seen from the transition between Model 0 and Model 1, in which we account for the time difference between the publication dates of sender and receiver. With this effect in place, the curve changes its shape and shows an interesting increase up to the mid-80s. Finally, the introduction of citation and class effects further corrects the curve by considering two other wellknown phenomena: the tendency of increasing the number of citations per patent application and the increase in patent class assignments. Thanks to these adjustments, we conclude that the levels of similarity have not been constantly decreasing since 1976, but instead they show a more oscillating behavior.